Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

250 ◾ Bioinformatics

of a sequence for detecting the presence of the known binding sites of interest. PWM files

of known motifs can be download from motifs’ database such as JASPAR at “https://jaspar.

genereg.net/”. We will use MAST, which is one of the MEME Suite programs, to search for

known motifs in our sequences. Assume that we wish to search for TATA binding site in

our example sequences. First, we need to download the motif file from a database and then

run the program as follows:

wget https://jaspar.genereg.net/api/v1/matrix/MA0108.1.meme

mast -mt 5e-02 \

-oc mast_chip1 \

MA0108.1.meme \

chip1_peaks.fasta

Three output files (mast.html, mast.xml, and mast.txt) will be saved in the “mast_chip1”

directory. You can use “firefox mast.html” to view the results.

6.4 SUMMARY

Identification of binding sites of proteins on the genomic DNA is critical for understand-

ing gene regulation, pathways, and role of specific proteins in gene regulation and their

implications of some diseases. Therefore, ChIP-Seq is used to study epigenetic change that

affects gene expression and the impact of such changes on diseases. The ChIP-Seq is the

most effective way to identify protein-binding sites on the genomic DNA. The binding

sites of transcription factors and RNA polymerase II are found in the promoter regions of

genes. In a ChIP experiment, the genomic DNA is cut into fragments. The DNA regions,

where the protein of interest binds, are precipitated using a specific antibody. The protein

molecules are then removed from the DNA fragments. The isolated DNA fragments are

then sequenced using one of the sequencing techniques. The DNA library preparation and

sequencing are similar to that of other sequencing applications. The sequence reads (in

FASTQ files) produced by the sequencer are for the ChIP-Seq DNA reads that are likely to

contain the binding sites for the protein of interest. The quality control step is carried out to

reduce the error and to trim and remove adaptors and other technical sequences that may

affect the analysis results. The cleaned reads are then aligned to a reference genome to pro-

duce BAM files that contain the alignment information of the ChIP reads. The unaligned,

random, and mitochondrial reads are usually removed from the BAM files to reduce the

computational burden. The peak enrichment regions, where the binding sites are most

likely to be found, are called using one of the peak-calling programs. The peak information

for each sample is saved in a BED file. We have used R Bioconductor package to visualize

the distribution of the peaks and to perform annotation and functional analysis including

GO and KEGG pathways. GO and KEGG enrichment analyses provide knowledge-based

biological information. Finally, we used motif discovery programs to identify the motifs

on the promoter regions.